Load and preview the data:
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
There are 13 variables in our dataset. First one is a unique ID. Then there are 11 independent reature variables, and one output variable: quality. It is an integer from 1 to 10. In our data, it ranges from 3 to 9. Median quality is 6 and mean is only slightly lower at 5.88.
Let’s see a histogram of wines by quality:
We can see that most of the wines fit into the bin with quality equal to 6, average quality. Distribution is normal, and slightly right-skewed.
Fixed acidity:
Volatile acidity:
Citric acid:
Residual sugar:
Chlorides:
Free sulfur dioxide:
Total sulfur dioxide:
Density:
PH:
Sulphates:
Alcohol:
All of the variables are distributed normally. A lot of residual sugar values fall into the same bin between 1 and 2. There’s a few outliers in bin 65-66. Other plots that have distinct outliers are free sulfur dioxide and density.
Let’s see a table of wine counts by quality.
## Source: local data frame [7 x 3]
##
## quality n percent
## (int) (int) (dbl)
## 1 3 20 0.41
## 2 4 163 3.33
## 3 5 1457 29.75
## 4 6 2198 44.88
## 5 7 880 17.97
## 6 8 175 3.57
## 7 9 5 0.10
There are 4898 wines in the white wine dataset. All of them are variants of the Portuguese “Vinho Verde” wine. There are 11 factor variables and one output variable (quality). Quality can be converted to a factor variable levels from 1 to 10. Most of the wines fall into the average quality (6) - we created a grouped set that demonstrates that. All other variables are numeric, mostly measurements of chemical elements by volume in g / dm^3, except alcohol which is measured in % by volume and pH which is measured on scale of 0 to 14 (solutions with a pH less than 7 are acidic and solutions with a pH greater than 7 are basic). We are interested to know how our input variables affect our output variable.
Correlation matrix:
## [1] 4898 13
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.02269729 0.289180698
## volatile.acidity -0.02269729 1.00000000 -0.149471811
## citric.acid 0.28918070 -0.14947181 1.000000000
## residual.sugar 0.08902070 0.06428606 0.094211624
## chlorides 0.02308564 0.07051157 0.114364448
## free.sulfur.dioxide -0.04939586 -0.09701194 0.094077221
## total.sulfur.dioxide 0.09106976 0.08926050 0.121130798
## density 0.26533101 0.02711385 0.149502571
## pH -0.42585829 -0.03191537 -0.163748211
## sulphates -0.01714299 -0.03572815 0.062330940
## alcohol -0.12088112 0.06771794 -0.075728730
## quality -0.11366283 -0.19472297 -0.009209091
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.08902070 0.02308564 -0.0493958591
## volatile.acidity 0.06428606 0.07051157 -0.0970119393
## citric.acid 0.09421162 0.11436445 0.0940772210
## residual.sugar 1.00000000 0.08868454 0.2990983537
## chlorides 0.08868454 1.00000000 0.1013923521
## free.sulfur.dioxide 0.29909835 0.10139235 1.0000000000
## total.sulfur.dioxide 0.40143931 0.19891030 0.6155009650
## density 0.83896645 0.25721132 0.2942104109
## pH -0.19413345 -0.09043946 -0.0006177961
## sulphates -0.02666437 0.01676288 0.0592172458
## alcohol -0.45063122 -0.36018871 -0.2501039415
## quality -0.09757683 -0.20993441 0.0081580671
## total.sulfur.dioxide density pH
## fixed.acidity 0.091069756 0.26533101 -0.4258582910
## volatile.acidity 0.089260504 0.02711385 -0.0319153683
## citric.acid 0.121130798 0.14950257 -0.1637482114
## residual.sugar 0.401439311 0.83896645 -0.1941334540
## chlorides 0.198910300 0.25721132 -0.0904394560
## free.sulfur.dioxide 0.615500965 0.29421041 -0.0006177961
## total.sulfur.dioxide 1.000000000 0.52988132 0.0023209718
## density 0.529881324 1.00000000 -0.0935914935
## pH 0.002320972 -0.09359149 1.0000000000
## sulphates 0.134562367 0.07449315 0.1559514973
## alcohol -0.448892102 -0.78013762 0.1214320987
## quality -0.174737218 -0.30712331 0.0994272457
## sulphates alcohol quality
## fixed.acidity -0.01714299 -0.12088112 -0.113662831
## volatile.acidity -0.03572815 0.06771794 -0.194722969
## citric.acid 0.06233094 -0.07572873 -0.009209091
## residual.sugar -0.02666437 -0.45063122 -0.097576829
## chlorides 0.01676288 -0.36018871 -0.209934411
## free.sulfur.dioxide 0.05921725 -0.25010394 0.008158067
## total.sulfur.dioxide 0.13456237 -0.44889210 -0.174737218
## density 0.07449315 -0.78013762 -0.307123313
## pH 0.15595150 0.12143210 0.099427246
## sulphates 1.00000000 -0.01743277 0.053677877
## alcohol -0.01743277 1.00000000 0.435574715
## quality 0.05367788 0.43557472 1.000000000
Pairs that show moderate linear correlation (0.3 and up) are:
Pairs that show strong linear correlation (0.7 and up) are:
Correlation plot (ggpairs):
Fixed acidity and volatile acidity:
Acidity seems to vary a lot, but faceting by quality, we can see that higher quality wines have acidities concenrated in much smaller range. For fixed acidity, it falls into a range between 6 and 9, and for volatile acidity, it’s below 0.5.
Acidity vs pH:
There is a weak linear dependency between the total of acidities and pH, which makes sense: more acidic wine means lower pH score.
Citric acid and residual sugar:
Here again, we see a very wide variation of the two factors, but just like before, for higher quality wines (7, 8, 9) the spread is not quite so big. Citric acid mostly clumps within the range of 0.3-0.45, and resudual sugar in the range of 0-12.
Residual.sugar vs alcohol:
Free and total sulphur dioxide:
Free SO2 and total SO2 have a dependency that is close to linear, which is also logical.
We see that in good wines (blues and greens), free SO2 should not be too low (mostly between 10-50), but then total SO2 should not be too high (mostly falls below 175).
Total sulfur dioxide vs alcohol:
We can see the moderate linear dependency.
Density and pH:
We are faceting by quality here, and it does not look like density plays any role in quality, the range is pretty much the same on all grid facets. It is curious, because we did see correlation between density and quality in our matrix (it was only 0.31, but higher than many others). As for pH, however, the spread again narrows (3.0-3.6) with better quality wines.
Density vs quality:
And yet, when we plot density vs quality, the linear relationship presents itself.
Density vs residual.sugar:
We can clearly see the strng linear dependency that was pointed out by correlation coefficient.
Density vs alcohol:
Here, correlation coefficient was negative, but even higher than with previous plot.
We are combining chlorides, sulphates and alcohol here, and faceting by quality again. Alcohol pattern is not very clear, but it seems that wines of highest quality (9) are rather heavy on alcohol content (10-13), and a lot of the lighter wines (8-9) fall into 3-5 quality. Best wine comes with lower clorides (0.01 - 0.07) and moderate sulphates (0.1-0.9).
If we plot alcohol content on histogram and color it by quality, it confirms the same theory: higher alcohol content correlates with higher quality.
This dataset is quite challenging, because even though the factor variables apparently affect the target variable somehow, it’s hard to pinpoint a descriptive function. There is no linear, exponential, square root, or other discernible dependency between any of the input variables and an output variable. We can, however, make some conclusions as to ranges of our input variables which seem to correlate with “better” values of our output variable. I have chosen some of the plots to demonstrate what I mean, first one is acidity:
While acidity varies a lot within each of the facets, it does not vary quite so much within facets marked 7, 8 and 9. So we can make an assumption that wines that fall within his narrower range of acidity are more likely to be evaluated as high quality. These ranges are 6-9 for fixed acidity and 0.01-0.5 for volatile acidity.
Next plot is sulphur dioxide, colored by quality. Again, good quality corresponds to narrower range of the input variable, which is 10-50 for free SO2 and 50-175 for total SO2.
Looking at chlorides, sulphates and alcohol faceted by quality, we can see the link between higher alcohol content (10-13) and higher quality. Chlorides should be kept low (0.01 - 0.07) and sulphates moderate (0.1-0.9).
A few of the variables were strongly correlated. It may mean that if we model quality based on other factors, we don’t have to include all of the factors into the model. For example, we could only include total sulfur dioxide and not free sulfur dioxide; include alcohol but not include density, which is highly correlated with alcohol:
In this dataset, we have 11 input factors that potentially affect the output variable. All of them are normally distributed. Some of them are correlated. It would be tempting to consolidate some, for example, roll up fixed acidity, volatile acidity and citric acid into “total.acidity”, and then because the total is correlated with pH, exclude it entirely from the model. Potentially, we could also drop density, because it’s highly correlated with alcohol, or residual sugar, which is highly correlated with density.
However, I would be reluctant to exlude anything when modeling, so as not to lose the important details. For example, from the dataset description it is clear that not all acids are created equal in terms of wine quality - while citric acid adds “flavor and freshness”, acetic acid leads to “vinegar taste”. I think that the only variable that we could somewhat safely drop is free sulfur dioxide, which is part of total sulfur dioxide.
Also, the only factors that show visible linear correlation with quality are alcohol and density. We already know, that density depends on alcohol, so dropping that, we seem to only have alcohol left as a “good” predictor of quality. But we know, that there’s much more to wine that alcohol content, otherwise pure alcohol would be declared best wine.
Because there is no discernible linear pattern between input and output variable (or a relationship that can be transformed to linear), unfortunately, I can’t fit a linear model to this dataset to predict wine quality using a combination of input factors. But I can make some assumptions as to what values of factor variables would likely be present in good quality wines:
To bild a model of quality, some machine learning approach may also work, such as support vector machines, or a neural network.